Goto

Collaborating Authors

 data quality issue



Does Homophily Help in Robust Test-time Node Classification?

Jiang, Yan, Qiu, Ruihong, Huang, Zi

arXiv.org Artificial Intelligence

Homophily, the tendency of nodes from the same class to connect, is a fundamental property of real-world graphs, underpinning structural and semantic patterns in domains such as citation networks and social networks. Existing methods exploit homophily through designing homophily-aware GNN architectures or graph structure learning strategies, yet they primarily focus on GNN learning with training graphs. However, in real-world scenarios, test graphs often suffer from data quality issues and distribution shifts, such as domain shifts across users from different regions in social networks and temporal evolution shifts in citation network graphs collected over varying time periods. These factors significantly compromise the pre-trained model's robustness, resulting in degraded test-time performance. With empirical observations and theoretical analysis, we reveal that transforming the test graph structure by increasing homophily in homophilic graphs or decreasing it in heterophilic graphs can significantly improve the robustness and performance of pre-trained GNNs on node classifications, without requiring model training or update. Motivated by these insights, a novel test-time graph structural transformation method grounded in homophily, named GrapHoST, is proposed. Specifically, a homophily predictor is developed to discriminate test edges, facilitating adaptive test-time graph structural transformation by the confidence of predicted homophily scores. Extensive experiments on nine benchmark datasets under a range of test-time data quality issues demonstrate that GrapHoST consistently achieves state-of-the-art performance, with improvements of up to 10.92%. Our code has been released at https://github.com/YanJiangJerry/GrapHoST.



CleanPatrick: A Benchmark for Image Data Cleaning

Gröger, Fabian, Lionetti, Simone, Gottfrois, Philippe, Gonzalez-Jimenez, Alvaro, Amruthalingam, Ludovic, Goessinger, Elisabeth Victoria, Lindemann, Hanna, Bargiela, Marie, Hofbauer, Marie, Badri, Omar, Tschandl, Philipp, Koochek, Arash, Groh, Matthew, Navarini, Alexander A., Pouly, Marc

arXiv.org Artificial Intelligence

Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (22%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and adopts typical ranking metrics mirroring real audit workflows. Benchmarking classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, and SelfClean, we find that, on CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and label-error detection remains an open challenge for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies and paves the way for more reliable data-centric artificial intelligence.


Formative Study for AI-assisted Data Visualization

Saber, Rania, Fariha, Anna

arXiv.org Artificial Intelligence

This formative study investigates the impact of data quality on AI-assisted data visualizations, focusing on how uncleaned datasets influence the outcomes of these tools. By generating visualizations from datasets with inherent quality issues, the research aims to identify and categorize the specific visualization problems that arise. The study further explores potential methods and tools to address these visualization challenges efficiently and effectively. Although tool development has not yet been undertaken, the findings emphasize enhancing AI visualization tools to handle flawed data better. This research underscores the critical need for more robust, user-friendly solutions that facilitate quicker and easier correction of data and visualization errors, thereby improving the overall reliability and usability of AI-assisted data visualization processes.


Towards Reliable Dermatology Evaluation Benchmarks

Gröger, Fabian, Lionetti, Simone, Gottfrois, Philippe, Gonzalez-Jimenez, Alvaro, Groh, Matthew, Daneshjou, Roxana, Consortium, Labelling, Navarini, Alexander A., Pouly, Marc

arXiv.org Artificial Intelligence

Benchmark datasets for digital dermatology unwittingly contain inaccuracies that reduce trust in model performance estimates. We propose a resource-efficient data-cleaning protocol to identify issues that escaped previous curation. The protocol leverages an existing algorithmic cleaning strategy and is followed by a confirmation process terminated by an intuitive stopping criterion. Based on confirmation by multiple dermatologists, we remove irrelevant samples and near duplicates and estimate the percentage of label errors in six dermatology image datasets for model evaluation promoted by the International Skin Imaging Collaboration. Along with this paper, we publish revised file lists for each dataset which should be used for model evaluation. Our work paves the way for more trustworthy performance assessment in digital dermatology.


Quality Issues in Machine Learning Software Systems

Côté, Pierre-Olivier, Nikanjam, Amin, Bouchoucha, Rached, Basta, Ilan, Abidi, Mouna, Khomh, Foutse

arXiv.org Artificial Intelligence

Context: An increasing demand is observed in various domains to employ Machine Learning (ML) for solving complex problems. ML models are implemented as software components and deployed in Machine Learning Software Systems (MLSSs). Problem: There is a strong need for ensuring the serving quality of MLSSs. False or poor decisions of such systems can lead to malfunction of other systems, significant financial losses, or even threats to human life. The quality assurance of MLSSs is considered a challenging task and currently is a hot research topic. Objective: This paper aims to investigate the characteristics of real quality issues in MLSSs from the viewpoint of practitioners. This empirical study aims to identify a catalog of quality issues in MLSSs. Method: We conduct a set of interviews with practitioners/experts, to gather insights about their experience and practices when dealing with quality issues. We validate the identified quality issues via a survey with ML practitioners. Results: Based on the content of 37 interviews, we identified 18 recurring quality issues and 24 strategies to mitigate them. For each identified issue, we describe the causes and consequences according to the practitioners' experience. Conclusion: We believe the catalog of issues developed in this study will allow the community to develop efficient quality assurance tools for ML models and MLSSs. A replication package of our study is available on our public GitHub repository.


are-your-data-quality-enough-to-support-machine-learning-ai-plans

#artificialintelligence

AI is a priority for governments and businesses worldwide. Poor data quality is a key aspect of AI that has been overlooked. AI algorithms are based on reliable data in order to produce optimal results. However, if the data is incomplete, incorrect, or not sufficient, it can have devastating consequences. Poor data quality can result in adverse outcomes for AI systems that identify patients' diseases. These systems can produce inaccurate diagnoses and predictions, which can lead to misdiagnosis and delayed treatment.


Interactive data prep widget for notebooks powered by Amazon SageMaker Data Wrangler

#artificialintelligence

According to a 2020 survey of data scientists conducted by Anaconda, data preparation is one of the critical steps in machine learning (ML) and data analytics workflows, and often very time consuming for data scientists. Data scientists spend about 66% of their time on data preparation and analysis tasks, including loading (19%), cleaning (26%), and visualizing data (21%). Amazon SageMaker Studio is the first fully integrated development environment (IDE) for ML. With a single click, data scientists and developers can quickly spin up Studio notebooks to explore datasets and build models. If you prefer a GUI-based and interactive interface, you can use Amazon SageMaker Data Wrangler, with over 300 built in visualizations, analyses, and transformations to efficiently process data backed by Spark without writing a single line of code.


Council Post: Data Quality Is Also An AI Problem

#artificialintelligence

Emanuel Younanzadeh is VP Marketing at The Modern Data Company. Artificial intelligence (AI) continues its rise to prominence within the business world. The number of companies using AI today and the range of problems AI is being applied to are both increasing steadily. However, there is one issue that is plaguing AI just as much as it has plagued analytics of all kinds over the years--data quality. Organizations put tremendous resources behind ensuring the quality of their data.